data <- read_data()
# data <- read_data("../data/lgd_dataset.csv")
data$appartment_collateral_mv <- ifelse(
data$real_estate_type == "appartment",
data$mortgage_collateral_mv,
0
)
data$house_collateral_mv <- ifelse(
data$real_estate_type == "single family house",
data$mortgage_collateral_mv,
0
)
data$office_collateral_mv <- ifelse(
data$real_estate_type == "office building",
data$mortgage_collateral_mv,
0
)
data$single_family_house <- ifelse(
data$real_estate_type == "single family house",
1,
0
)
data$lgd_nom <- data$lgd * data$loan_amount
data$lgd_log <- log(data$lgd+1)
data$house_collateral_ratio <- data$house_collateral_mv / data$loan_amount
data$appartment_collateral_ratio <- data$appartment_collateral_mv / data$loan_amount
data$office_collateral_ratio <- data$office_collateral_mv / data$loan_amount
data$additional_collateral_ratio <- data$additional_collateral_mv / data$loan_amount
data$loan_amount_log <- log(data$loan_amount)
data$house_collateral_log <- log(data$house_collateral_mv+1)
data$appartment_collateral_log <- log(data$appartment_collateral_mv+1)
data$office_collateral_log <- log(data$office_collateral_mv+1)
data$additional_collateral_log <- log(data$additional_collateral_mv+1)
data$rr <- 1 - data$lgd
data$rr_nom <- data$rr * data$loan_amount
data$rr_nom_log <- log(data$rr_nom)
data$loan_to_mortg <- data$loan_amount / data$mortgage_collateral_mv
data$loan_to_addit <- data$loan_amount / data$additional_collateral_mv
data$loan_to_colla <- data$loan_amount / (data$mortgage_collateral_mv + data$additional_collateral_mv)
It makes intuitively sense that the dynamics with which loans are provided, repaid and the overall riskyness of a loan is dependent on the loan type. For example, higher risk might be associated with a loan which is used by a business to enter a new market compared to the buying of a house by a private person. Looking at the distribution of the recovery in CHF makes the difference visible (using logs for better visibility):
Therefore, we might want to fit one model for private customers and one model for corporate customers separately.
Note that using some model (e.g. linear regression) we might predict negative values for loss given default which - at least in this context - does not make sense. Therefore, after prediction the estimates which are negative are set to zero.
First, let’s subset the dataset and only focus on private clients. Note that the observed loss given default seems to depend on the type of real estate:
Economic intuition: selling conditions for real estate might differ across real estate types. Single family houses might to be more difficult to sell - or more generally can be sold in case of default with a higher loss.
For this reason, the market value per real estate type is treated differently, i.e. for each real estate category one column is added containing the market value if the corresponding contract is on the given real estate type and zero otherwise, same of additional collateral:
| real_estate_type | house_collateral_mv | appartment_collateral_mv | |
|---|---|---|---|
| 736 | single family house | 1636145 | 0.0 |
| 827 | single family house | 1766339 | 0.0 |
| 719 | single family house | 1521278 | 0.0 |
| 274 | appartment | 0 | 1444293.6 |
| 648 | single family house | 1784238 | 0.0 |
| 519 | appartment | 0 | 920868.9 |
If we consider nominal values, there should be a positive relationship between the market values of the collateral types and the nominal loss given default (resp. the recovery in CHF):
Hence, the market values could be a good predictor of the loss given default in CHF, which then can be converted to the loss given default in % by dividing with the loan amount.
The first model uses the corresponding market values of the collateral to predict the loss given default in CHF:
\[ \text{nominal LGD}_i = \beta_0 + \beta_1*\text{MV appartment}_i + \beta_2*\text{MV house}_i + \beta_3*\text{MV additional}_i + \epsilon_i \]
Summary of this model:
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-248109 -19027 -5268 3369 611394
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6620.18795 4670.52353 -1.417 0.15673
loan_amount 0.22713 0.05127 4.430 0.0000107 ***
appartment_collateral_mv -0.16534 0.04010 -4.123 0.0000411 ***
house_collateral_mv -0.13249 0.04015 -3.300 0.00101 **
additional_collateral_mv -0.09204 0.05156 -1.785 0.07461 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 68200 on 837 degrees of freedom
Multiple R-squared: 0.2508, Adjusted R-squared: 0.2472
F-statistic: 70.06 on 4 and 837 DF, p-value: < 0.00000000000000022
This models RMSE score is 47349.3 CHF resp. 4.79%.
Following plot shows the difference between the realized LGD values and the corresponding predictions:
Note that the predictions for single family houses (on the RHS of vertical line) are systematically worse than for apartments:
We might to look deeper into this difference by real estate types.
Overall, we estimate a loss of 23.07 mio. CHF while the observed loss is 22.64 mio. CHF, hence we overestimate the loss by 0.43 mio. CHF what we might prefer from a risk management point of view (compared to underestimate the loss).
Note the distribution of the target variable
Maybe converting the target variable to logs and hence also the
predictors such that they have similar scaling might lead to better
performance. Note that the target variable includes a lot of zeros,
which result in -Inf when converting to logs which cannot
be used for regression. Therefore, the loss given default is converted
to the recovery rate resp. the recovery in CHF before taking the log,
leading to following distribution:
Unfortunately, there is no economic valid way to transform the market value of collaterals such that we do not have any zeros anymore. Therefore, one CHF is added to all market values, resulting in a negligible difference if the market value is positive as the values are quite large and zero for no collateral since \(log(1) = 0\).
The model which is fitted now:
\[ \text{log(nominal recovery)}_i = \beta_0 + \beta_1*\text{log(MV appartment)}_i + \beta_2*\text{log(MV house)}_i + \beta_3*\text{log(MV additional)}_i + \epsilon_i \]
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.34719 0.00238 0.01203 0.01947 0.07834
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0127381 0.0480673 0.265 0.791070
loan_amount_log 0.8353316 0.0441020 18.941 < 0.0000000000000002 ***
appartment_collateral_log 0.1597596 0.0437136 3.655 0.000274 ***
house_collateral_log 0.1568866 0.0437039 3.590 0.000350 ***
additional_collateral_log 0.0012379 0.0005057 2.448 0.014565 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.054 on 837 degrees of freedom
Multiple R-squared: 0.9912, Adjusted R-squared: 0.9912
F-statistic: 2.365e+04 on 4 and 837 DF, p-value: < 0.00000000000000022
[1] -1209921
[1] "we expect 23.851894932812 mio CHF loss. Real loss was 22.6419735018877"
Overall, we estimate a loss of 23.85 mio. CHF while the observed loss is 22.64 mio. CHF, hence we overestimate the loss by 1.21 mio. CHF what we might prefer from a risk management point of view (compared to underestimate the loss). However, the first model also overestimates the loss but is still more accurate. Therefore, we might prefer model 1 over model 2.
Note that the response variable LGD cannot be negative and therefore is bounded. There might be an underlying linear model but the observed LGD values are truncated:
\[ \text{LGD}_i = \max(\widehat{LGD}_i, 0) \]
Therefore, let’s try tobit regression
Call:
AER::tobit(formula = lgd ~ appartment_collateral_ratio + house_collateral_ratio +
additional_collateral_ratio, left = 0, right = 1, dist = "logistic",
data = segment_private)
Observations:
Total Left-censored Uncensored Right-censored
842 617 225 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.93431 0.14793 6.316 0.00000000026894
appartment_collateral_ratio -0.81429 0.11954 -6.812 0.00000000000962
house_collateral_ratio -0.72907 0.11746 -6.207 0.00000000053971
additional_collateral_ratio -0.78708 0.14184 -5.549 0.00000002873278
Log(scale) -2.74016 0.06004 -45.639 < 0.0000000000000002
(Intercept) ***
appartment_collateral_ratio ***
house_collateral_ratio ***
additional_collateral_ratio ***
Log(scale) ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Scale: 0.06456
Logistic distribution
Number of Newton-Raphson Iterations: 4
Log-likelihood: -85.26 on 5 Df
Wald-statistic: 100.8 on 3 Df, p-value: < 0.000000000000000222
Note: Tobit predicts what the result ought to be in the absence of the truncation. Therefore, convert negative estimates to 0.
Although the tobit model seems to predict more accurately, the problem is that using the tobit model we estimate a loss of 4.75 mio. CHF while the observed loss is 22.64 mio. CHF, hence we highly underestimate the loss by 17.89 mio.
We might change the lower bound of our estimations, i.e. we do not predict an LGD below a certain value. To still have a data driven decision process, we might set this lower bound to the mean of observed LGD per real estate type:
Now we estimate a loss of 21.79 mio. CHF while the observed loss is 22.64 mio. CHF, hence we still underestimate the loss by 0.85 mio. We might switch back to simple linear regression.
Note that the predicted values for single family house is much worse than for apartments. Therefore, one might fit a model on the subset for apartment and subset for single family house separately:
[1] 67651.11
Still the same problem…
[1] -284405.2
[1] "we expect 22.926378746522 mio CHF loss. Real loss was 22.6419735018877"
We still have the same problem…
First, note that the only real estate type of corporate clients is office building. Therefore, the positions only vary in loan amount, the market value of the collateral and whether the client provides additional collateral and its corresponding market value. Therefore, we need might further logic to segment corporate clients more granular.
Looking at the ratio \(\frac{\text{loan amount}}{\text{collateral}}\), we can see that not all clients must provide proportionally the same value of collateral.
Economic intuition: due to client specific factors, there could be higher or lower risk associated with the position, which require more or less provided collateral.
mean lgd for left tail: 0.0856024
mean lgd for right tail: 0.2684521
mean lgd over whole sample:0.1388319
Given this logic, there should be a positive relationship between loan to collateral ratio and loss given default:
First try simple model of regressing on loan amount and market value of collateral types on LGD in CHF:
\[ \text{nominal LGD}_i = \beta_0 + \beta_1*\text{loan amount}_i + \beta_2*\text{MV office}_i + \beta_3*\text{MV additional}_i + \epsilon_i \]
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-15620458 -1092449 -237547 582744 20755362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 72141.3778 147286.3091 0.490 0.624
loan_amount 1.7057 0.1513 11.276 <0.0000000000000002
office_collateral_mv -1.2218 0.1174 -10.406 <0.0000000000000002
additional_collateral_mv -1.3694 0.1557 -8.797 <0.0000000000000002
(Intercept)
loan_amount ***
office_collateral_mv ***
additional_collateral_mv ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2732000 on 607 degrees of freedom
Multiple R-squared: 0.4135, Adjusted R-squared: 0.4106
F-statistic: 142.7 on 3 and 607 DF, p-value: < 0.00000000000000022
Root Mean Squared Error:
[1] 0.1558712
The prediction performance is not really sufficient… Let’s analyze if we can find a relationship between the prediction error and some variables.
Look at the relationship between the loan to collateral ratio and the prediction error:
It seems that when increasing the loan to collateral ratio, we tend to overestimate the loss given default. Ideas to address this:
This makes economically sense:
If \(\frac{\text{loan amount}}{\text{collateral}}\) higher, we need less collateral to cover the loan, meaning the corresponding client might be associated with lower risk, hence the expected loss given default should be lower (ceteris paribus). If we do not account for this relationship, we ignore the fact that the client is of lower risk and therefore has a lower loss given default.
Differentiate by collateral type:
Using this logic, let’s first just run a regression of loan to collateral on LGD:
\[ \text{LGD}_i = \beta_0 + \beta_1*\text{Loan to Collateral}_i + \epsilon_i \]
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.18465 -0.12691 -0.03866 0.09584 0.51874
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3636 0.1522 -2.389 0.01718 *
loan_to_collateral 0.6397 0.1936 3.304 0.00101 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1484 on 609 degrees of freedom
Multiple R-squared: 0.01761, Adjusted R-squared: 0.016
F-statistic: 10.92 on 1 and 609 DF, p-value: 0.001008
Root Mean Squared Error:
[1] 0.1483979
The fit is not really that much better…
Now “combine” model 1 and 2 by regressing LGD on loan to collateral, office collateral to loan ratio and the additional collateral to loan ratio:
\[ \text{LGD}_i = \beta_0 + \beta_1*\text{Loan to Collateral}_i +\beta_2*\text{MV Office to Loan}_i + \beta_3*\text{MV additional to Loan}_i + \epsilon_i \]
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.18796 -0.12687 -0.03173 0.09724 0.50694
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.4743 4.4002 -0.335 0.738
loan_to_collateral 1.3442 2.8354 0.474 0.636
office_collateral_ratio 0.4463 1.7056 0.262 0.794
additional_collateral_ratio 0.2245 1.7204 0.131 0.896
Residual standard error: 0.148 on 607 degrees of freedom
Multiple R-squared: 0.026, Adjusted R-squared: 0.02118
F-statistic: 5.401 on 3 and 607 DF, p-value: 0.001133
Root Mean Squared Error:
[1] 0.1478913
Maybe we get better performance if we further segmentize our sample. Let’s divide our data set by loan amount (high, medium and low loans):
threshold_high <- 0.90
threshold_low <- 0.75
segment_corporate_high <- segment_corporate[segment_corporate$loan_amount >= quantile(segment_corporate$loan_amount, threshold_high),]
segment_corporate_medium <- segment_corporate[segment_corporate$loan_amount >= quantile(segment_corporate$loan_amount, threshold_low) & segment_corporate$loan_amount < quantile(segment_corporate$loan_amount, threshold_high),]
segment_corporate_low <- segment_corporate[segment_corporate$loan_amount < quantile(segment_corporate$loan_amount, threshold_low),]
hist(segment_corporate$loan_amount, breaks = 100)
abline(v = quantile(segment_corporate$loan_amount, threshold_high))
abline(v = quantile(segment_corporate$loan_amount, threshold_low))
Fit model 3 on each subset separately:
\[ \text{LGD}_i = \beta_0 + \beta_1*\text{Loan to Collateral}_i +\beta_2*\text{MV Office to Loan}_i + \beta_3*\text{MV additional to Loan}_i + \epsilon_i \]
| Dependent variable: | |||
| regressors | |||
| highest | high | regular | |
| (1) | (2) | (3) | |
| loan_amount | 2.523*** | 0.171 | 0.157 |
| (0.444) | (0.393) | (0.175) | |
| office_collateral_mv | -1.896*** | 0.023 | 0.013 |
| (0.343) | (0.305) | (0.137) | |
| additional_collateral_mv | -2.024*** | -0.190 | -0.411** |
| (0.457) | (0.405) | (0.173) | |
| Constant | 3,018,068.000 | -1,255,298.000 | -57,557.440 |
| (1,854,473.000) | (2,100,442.000) | (114,372.200) | |
| Observations | 62 | 91 | 458 |
| R2 | 0.378 | 0.047 | 0.226 |
| Adjusted R2 | 0.346 | 0.014 | 0.221 |
| Residual Std. Error | 6,338,295.000 (df = 58) | 3,299,786.000 (df = 87) | 1,207,234.000 (df = 454) |
| F Statistic | 11.745*** (df = 3; 58) | 1.425 (df = 3; 87) | 44.206*** (df = 3; 454) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
RMSE for high segment: 12.97% resp. 8598379.84 CHF
RMSE for medium segment: 15.59% resp. 2770251.67 CHF
RMSE for low segment: 15.04% resp. 1198669.42 CHF
Not sufficient…
Let’s use the logic we derived earlier and segmentize the data set in low, medium and high risk clients:
segment_corporate_high <- segment_corporate[segment_corporate$loan_to_collateral > mu+sigma,]
segment_corporate_medium <- segment_corporate[segment_corporate$loan_to_collateral <= sigma+mu & segment_corporate$loan_to_collateral >= mu-sigma,]
segment_corporate_low <- segment_corporate[segment_corporate$loan_to_collateral < mu-sigma,]
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-6748572 -667852 221447 840999 6984018
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -646387.2240 283041.0449 -2.284 0.0247 *
loan_amount 0.4975 1.2268 0.406 0.6861
office_collateral_mv -0.2078 1.0086 -0.206 0.8373
additional_collateral_mv -1.0614 1.1861 -0.895 0.3733
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1990000 on 91 degrees of freedom
Multiple R-squared: 0.7433, Adjusted R-squared: 0.7348
F-statistic: 87.83 on 3 and 91 DF, p-value: < 0.00000000000000022
RMSE:
[1] 0.1756801
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-14766729 -958151 -180448 676154 19011234
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -55152.3550 187763.9106 -0.294 0.769
loan_amount 2.7166 0.3636 7.471 0.000000000000476 ***
office_collateral_mv -2.0012 0.2856 -7.006 0.000000000010005 ***
additional_collateral_mv -2.2572 0.3062 -7.371 0.000000000000926 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2804000 on 414 degrees of freedom
Multiple R-squared: 0.429, Adjusted R-squared: 0.4248
F-statistic: 103.7 on 3 and 414 DF, p-value: < 0.00000000000000022
RMSE:
[1] 0.152575
Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)
Residuals:
Min 1Q Median 3Q Max
-3999076 -1245613 -875950 54113 10544572
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1065542.4459 326731.7308 3.261 0.00155 **
loan_amount -0.5902 0.5159 -1.144 0.25551
office_collateral_mv 0.4539 0.3839 1.182 0.24008
additional_collateral_mv 0.1962 0.4062 0.483 0.63018
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2411000 on 94 degrees of freedom
Multiple R-squared: 0.04529, Adjusted R-squared: 0.01482
F-statistic: 1.486 on 3 and 94 DF, p-value: 0.2233
RMSE:
[1] 0.2730157